Skip to content
This repository has been archived by the owner on Dec 3, 2020. It is now read-only.

Fix #29: Attempt extraction after parsing is finished, before loading. #143

Merged
merged 3 commits into from
Oct 10, 2018

Conversation

Osmose
Copy link
Contributor

@Osmose Osmose commented Oct 1, 2018

I tested running extraction on document_start and universally found that extraction wouldn't turn up anything useful (surprise, you need a fully-parsed DOM to meaningfully extract anything). document_end is pretty consistently returning good product info, which saves us having to wait for every resource to load. As far as I can tell there's not much opportunity to start work earlier than this.

I also couldn't help myself and did a minor refactor of the extraction module to be more in line with how the other code modules are organized. It's mostly renames.

@Osmose Osmose requested a review from biancadanforth October 1, 2018 18:44
@Osmose Osmose force-pushed the robust-extraction branch from d0e07cc to f8ed9d3 Compare October 2, 2018 23:46
Copy link
Collaborator

@biancadanforth biancadanforth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks Osmose. The refactor makes a lot of sense to me and seems to be working just fine. I tried this out on 7 random product pages, and I found that 6 of 7 extracted successfully at this earlier step.

I suggest two changes here, one being very important: Only attemptExtraction a second time if the first time fails; as-is, we're running it twice on every page no matter what.

date: (new Date()).toISOString(),
},
});
if (extractedProduct) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fallback extraction does not currently return null if it doesn’t find anything, so this would be a good time to change that before adding this existence check. (My bad)

We do also check for correct data types via propTypes further downstream, but it'd be nice to be consistent between the two extraction methods.

if (document.readyState === 'complete') {
// Extract immediately, and again if the readyState changes.
attemptExtraction();
document.addEventListener('readystatechange', () => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably only want to add this listener and attempt extraction again if the first attempt failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to perform a second extraction in case JavaScript has modified the page and changed the product info, to ensure we get the correct product info every time. I'm not sure how common this is, though, or how performing two extractions affects the user's experience.

I think we should keep it, but can be convinced otherwise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah hmm... so see if we can get something at all to put in the popup sooner, and then update with the second round of extraction...

Perhaps we could add a scalar probe that increments if the results of the first to the second extraction are different and compare that to the number of extraction attempts? Might be a wishlist probe.

What do you think about that idea @Osmose ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wouldn't bother adding a probe like that initially.

@Osmose Osmose requested a review from biancadanforth October 8, 2018 18:03
@Osmose
Copy link
Contributor Author

Osmose commented Oct 8, 2018

I refactored extraction to return null in cases where the selectors or tags for all features aren't found. I also split open graph and selector-based extraction into two separate modules. In the future, if we add optional extraction features, we'll need to modify these, but that's fine.

Copy link
Collaborator

@biancadanforth biancadanforth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Osmose! Looks great.

I found a bug I actually introduced earlier in the Open Graph extraction and just filed a bug. Will fix it shortly.

return null;
}

extractedProduct[feature] = metaEle.getAttribute('content');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bug here, since the propType for extractedProduct expects a type of Number for the price value downstream. I took care of this for fallback and Fathom extraction earlier, but I neglected to add it here. I filed a follow-up issue #154 . I'll take care of that right now.

Michael Kelly added 3 commits October 10, 2018 10:27
… found.

Open Graph and selector-based extraction are now separate forms of extraction
instead of a single, "fallback" extraction method.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants